Effective Implementation of DGEMM on Modern Multicore CPU
نویسندگان
چکیده
In this paper we will present a detailed study on tuning double-precision matrix-matrix multiplication (DGEMM) on the Intel Xeon E5-2680 CPU. We selected an optimal algorithm from the instruction set perspective as well software tools optimized for Intel Advance Vector Extensions (AVX). Our optimizations included the use of vector memory operations, and AVX instructions. Our proposed algorithm achieves a performance improvement of 33% compared to the latest results achieved using the Intel Math Kernel Library DGEMM subroutine.
منابع مشابه
Evaluation of DGEMM Implementation on Intel Xeon Phi Coprocessor
In this paper we will present a detailed study of implementing double-precision matrix-matrix multiplication (DGEMM) utilizing the Intel Xeon Phi Coprocessor. We discuss a DGEMM algorithm implementation running "natively" on the coprocessor, minimizing communication with the host CPU. We will run DGEMM across a range of matrix sizes natively as well using Intel Math Kernel Library. Our optimiza...
متن کاملAccurate CPU Power Modeling for Multicore Smartphones
CPU is a major source of power consumption in smartphones. Power modeling is a key technology to understand CPU power consumption and also an important tool for power management on smartphones. However, we have found that existing CPU power models on smartphones are ill-suited for modern multicore CPUs: they can give high estimation errors (up to 34%) and high estimation accuracy variation (mor...
متن کاملIs Cache Oblivious DGEMM a Viable Alternative?
We present an in-depth study of various implementations of DGEMM, using both the recursive and iterative programming styles. Recursive algorithms for DGEMM are usually cache-oblivious and they automatically block DGEMM’s operands A, B, C for the memory hierarchy. Iterative algorithms for DGEMM explicitly block A, B, C for the L1 cache, higher caches and memory. Our study shows that recursive DG...
متن کاملA Fast GEMM Implementation On a Cypress GPU
We present benchmark results of optimized dense matrix multiplication kernels for Cypress GPU. We write general matrix multiply (GEMM) kernels for single (SP), double (DP) and double-double (DDP) precision. Our SGEMM and DGEMM kernels show ∼ 2 Tflop/s and ∼ 470 Glop/s, respectively. These results for SP and DP correspond to 73% and 87% of the theoretical performance of the GPU, respectively. Cu...
متن کاملStructured Orthogonal Inversion of Block p-Cyclic Matrices on Multicore with GPU
We present a block structured orthogonal factorization (BSOF) algorithm and its parallelization for computing the inversion of block pcyclic matrices. We aim at the high performance on multicores with GPU accelerators. We provide a quantitative performance model for optimal host-device load balance, and validate the model through numerical tests. Benchmarking results show that the parallel BSOF...
متن کامل